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We solve the dynamics of on-line Hebbian learning in large perceptrons exactly, for the regime 
where the size of the training set scales linearly with the number of inputs. We consider both 
noiseless and noisy teachers. Our calculation cannot be extended to non-Hebbian rules, but 
the solution provides a convenient and welcome benchmark with which to test more general 
and advanced theories for solving the dynamics of learning with restricted training sets. 
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1 Introduction 



Considerable progress has been made in understanding the dynamics of supervised learning in 
layered artificial neural networks through the application of the methods of statistical mechanics. 
A recent review of work in this field is contained in O. For the most part, such theories 
have concentrated on systems where the training set is much larger than the number of weight 
updates. In such circumstances the probability that any given question will be repeated during 
the training process is negligible and it is possible to assume for large networks, via the central 
limit theorem, that their local field distribution is always Gaussian. In this paper we consider 
restricted training sets; we suppose that the size p of the training set scales linearly with N, 
the number of inputs. As a consequence the probability that a question will reappear during 
the training process is no longer negligible, the assumption that the local fields have Gaussian 
distributions is not tenable, and it is clear that correlations will develop between the weights 
and the questions in the training set as training progresses. In fact, the non-Gaussian character 
of the local fields should be a prediction of any satisfactory theory of learning with restricted 
training sets, as this is clearly demanded by numerical simulations. 

Several authors g, ||, [|, [|, [6|, f?J have discussed learning with restricted training sets but 
constructing a general theory is difficult. A simple model of learning with restricted training 
sets which can be solved exactly is therefore particularly attractive and provides a yardstick 
against which more difficult and sophisticated general theories can, in due course, be tested and 
compared. We show how this can be accomplished for on-line Hebbian learning in perceptrons 
with restricted training sets and we obtain exact solutions for the generalization error, the 
training error and the field distribution for a class of noisy teacher networks and student networks 
with arbitrary weight decay. We work out in detail the two particular but representative cases 
of output noise and Gaussian weight noise. Our theory is found to be in excellent agreement 
with numerical simulations and our predictions for the probability density of the student field 
are a striking confirmation of them, making it clear that we are indeed dealing with local fields 
which are non-Gaussian. An outline of our results is to appear in the conference proceedings ||. 

2 Definitions and Explicit Microscopic Expressions 

We study on-line learning in a student perceptron S, which tries to learn a task defined by a 
noisy teacher perceptron T. The student input-output mapping is specified by a weight vector 
J according to 

S: {-1,1} N -{-1,1} S(£)= sgn[J.£] . 

For a given J, this is a deterministic mapping from binary inputs to binary outputs. The teacher 
output T(£), on the other hand, is stochastic. In its most general form, it is determined by the 
probabilities P(T = ±1|£). These are related to the average teacher output T(£) for a given 
input £ by 

P(T = ±l|£) = i[l±T(0], or P(T|0 = i[l + TT(0]- (1) 

To ensure that this noisy teacher mapping can be thought of as the corrupted output of an 
underlying 'clean' perceptron with weights B* , we make the mild assumption that the average 
teacher output can be written in the form 

T(£)=T(y), y = B*-i (2) 



2 



with some function r(y). In other words, the noise process preserves, on average, the perceptron 
structure of the teacher. The uncorrupted teacher weight vector is taken to be normalized such 
that (B*) 2 = 1, with each component B* of 0(N~z). We also assume that inputs are sampled 
randomly from a uniform distribution]]] on { — 1, 1}^. Typical values of the (uncorrupted) 'teacher 
field' y are then of 0(1); in the thermodynamic limit N — > oo that we will be interested in, y is 
Gaussian with zero mean and unit variance. 

The class of noise processes allowed by (||) is quite large and includes the standard cases of 
output noise and Gaussian weight noise that are often discussed in the literature. For output 
noise, the sign of the clean teacher output sgn(y) is inverted with probability A, i.e., 

P(T|0 = (1 - A) 6(Ty) + X6(-Ty), r(y) = (1 - 2A) sgn(y) (3) 

For Gaussian weight noise, the teacher output is produced from a corrupted teacher weight 
vector B. The corrupted weights B differ from B* by the addition of Gaussian noise of standard 
deviation S/yiV to each component, i.e., 



P(B) 



N 



N/2 ( N 

e W (- — (B-B*f). (I) 



The scaling with N here is chosen to get a sensible result in the thermodynamic limit (corrupted 
and clean weights clearly need to be of the same order). The corrupted teacher field is then 
z = B- £ = y + A, with A a Gaussian random variable with zero mean and variance E 2 , and 
hence 

r(y) = < sgn(y + A)) A = erf(y/V2£). (5) 

In the numerical examples presented later, we will focus on the above two noise models. But our 
analytical treatment applies to any teacher that is compatible with the assumption (g) . This cov- 
ers, for example, the more complex cases of 'reversed wedge' teachers (where r(y) = sgn(y) for 
\y\ > d and r(y) = — sgn(y) otherwise, d being the wedge 'thickness') and noisy generalizations 
of these. 

Our learning rule will be the on-line Hebbian rule, i.e. 

J(£ + 1) = (l - J(i) + 1 e^T^ (6) 

where the non-negative parameters 7 and r/ are the weight decay and the learning rate, respec- 
tively. Learning starts from an initial set of student weights Jo = J(0), for which we assume 
(as for the teacher weights) that Jj(0) = 0(N~z). At each iteration step I a training exam- 
ple, comprising an input vector and the corresponding teacher output is picked at 
random (with replacement) from the training set D. This training set consists of p = aN ex- 
amples, D = {(£ M ,T M ), /i = l...p}; it remains unchanged throughout the learning process. 
Each training input vector ^ is assumed to be randomly drawn from { — 1, 1}^ (independently 
of other training inputs, and of Jo an d B*), and the output = T(£^) provided by the noisy 
teacher. We call this kind of scenario 'consistent noise': To each training input corresponds a 

lr This choice of input distribution is not critical. In fact, any other distribution with = and = <5y 

will give results identical to the ones for the present case in the limit N — > 00. Examples would be real-valued 
inputs with either a Gaussian distribution with zero mean and unit variance, or a uniform distribution over the 
hypersphere £ 2 = N. Likewise, we only actually require that assumption (^) should hold with probability one 
(i.e., for almost all inputs) in the limit N — > 00. 
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single output value which is produced by the teacher once and for all before learning begins; the 
teacher is not asked to produce new noisy outputs each time a training input is selected for a 
weight update. 

There are two sources of randomness in the above scenario. First of all there is the random 
realization of the 'path' f2 = {/x(0), (J>(1), ■ • •}• This is simply the dynamic randomness 
of the stochastic process that gives the evolution of the vector J; it arises from the random 
selection of examples from the training set. Averages over this process will be denoted as (...). 
Secondly there is the randomness in the composition of the training set. We will write averages 
over all training sets as (. . -) sets . We note that 



t mM)\ = I £ T") (for all 



p 

and that averages over all possible realizations of the training set are given by 



• • • ( oiv ) 

£l £ 2 £P V / Tl,...,TP=±l 



sets 

V 

1 ol\ f*2 D2^ 



where ^ € {-1,1}*- 

Our aim is to evaluate the performance of the on-line Hebbian learning rule (|6|) as a function 
of the number of training steps m. This calculation becomes tractable in the thermodynamic 
limit N — > oo; the appropriate time variable in this limit is t = m/N. Basic quantities of interest 
are the generalization error and the training error. The generalization error, which we choose 
to measure with respect to the clean teacher, is the probability of student and (clean) teacher 
producing different outputs on a randomly chosen test input. Hence E g = (9[— (J -£)(B* ■£)])£, 
with the usual result 

£ g = iarccos(-^). (7) 

Here Q = J 2 is the squared length of the student weight vector, and R = B* J its overlap with 
the teacher weights. These are our basic scalar observables. The training error E t is the fraction 
of errors that the students makes on the training set, i.e., the fraction of training outputs that 
are predicted incorrectly. It is given by 



E t = Jdx P(x,T)9(-Tx) 



T=±l 

where P(x,T) is the joint distribution of the student fields x = J £ and the teacher outputs 
T over the training set. Because the teacher outputs depend on the teacher fields, according 
to P(T\y) = |[1 + Tr(y)], it is useful to include the latter and to calculate the distribution 
P(x,y,T); we will see later that this also leads to a rather transparent form of the result. 
Formally, the joint field/output distribution is defined in the obvious way, 

P(x,y,T) = -J2 8(x-J-e)5(y-B*-e)tT,T». (8) 
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For infinitely large systems, N — > oo, one can prove that the fluctuations in mean- field 
observables such as {Q, R, P(x,y,T)}, due to the randomness in the dynamics, will vanish ||. 
Furthermore one assumes, with convincing support from numerical simulations, that for TV" — > oo 
the evolution of such observables, when observed for different random realizations of the training 
set, will be reproducible (i.e., the sample-to-sample fluctuations will also vanish, which is called 
'self-averaging'). Both properties are central ingredients of all current theories. We are thus 
led to the introduction of averages of our observables, both with respect to the dynamical 
randomness and with respect to the randomness in the training set (always to be carried out in 
precisely this order): 

Q(t) = lim ( (Q) } sets R(t) = lim ( (R) ) sets (9) 
P t (x,y,T)= lim ((P(x,y,T))) sets (10) 

N— »oo 

The large iV-limits here are taken at constant t and a, i.e., with the number of weight updates 
and the number of training examples scaling as m = Nt and p = Na, respectively. 

Iterating the learning rule @, we find an explicit expression for the student weight vector 
after m training steps: 

m— 1 

J( m ) = a m J + — (11) 
1 T 

Eq. (|l^) will be the natural starting point for our calculation. We will also frequently encounter 
averages of the form 

which we now calculate. The average over T is trivial and, using assumption @, gives t(B* ■ 
$,))£■ Provided all components of the vector v are of the same order, v = v ■ £ and y = -B* • £ 
become zero mean Gaussian variables for N —> oo with (wy) = i> • B* and (y 2 ) = (£?*) 2 = 1. By 
averaging over v first for fixed y, we thus obtain the desired result 



where 



(v ■ £T(£)>£ T =pv-B*, p= (yr(y)) = jDyyr{y) 



with the familiar short-hand Dy = (27r) 2e z2 ^ 2 dy. Using (|3|) and (|5|), one thus finds for the 
proportionality constant p 

[2 

p = \ — (1 — 2A) (output noise) (13) 

V 7T 

^^^= (Gaussian weight noise) (14) 

7T V 1 + S 2 

for the two noise models that we will consider in some detail. 
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3 Simple Scalar Observables 

It is a simple matter to calculate the values of Q and R after m learning steps, using (||). For 
Q, we find 

r) m— 1 2 TO— 1 

N e=o N2 i,£'=o 

We now average both with respect to dynamical (or path) randomness and with respect to the 
randomness in the training set, and take the limit N — > oo at constant learning time t = m/N 
(see (||)). Separating out the terms with £ = £' from the double sum, and using dl2|), we obtain 

1 tN tN 

Q(t) = ^^ + 2^0^^^ + ^-^^ 

+ i im i_ y ^-v*-* . ^(O T M) T m\\ . 

N -*°° N 2 W //sets 

Here Qo = an d Ro = Jo ' B* are the squared length and overlap of the initial student 
weights, respectively. After averaging over the dynamical randomness, the average in the last 
term becomes (1/p 2 ) Y?L v =\ (G 1 * ' £ U T' J 'T' / ) setg . The terms with fj, = v each contribute (£ M ) 2 = N 
to this sum; the others make a contribution of p 2 each, as one finds by applying fll2| ) twice. 
Assembling everything, we have 

Q(t) = e^Qo + 2pR Q \^\l - e-T*) + ^(1 - e" 2 ^) + t^{- + p 2 ) (1 - e~^) 2 (15) 

7 27 7 Z \a / 



where p is given by equations (|i3| , 14) in the examples of output noise and Gaussian weight 
noise, respectively, and more generally by (0). In a similar manner we find that 



tN 

R(t) = lim a tN R + ^-Ya tN ~ £ (( B*-e (e) T^ ,, 

= e -^i? + — (1-e-T*) (16) 
7 

We note in passing that our calculations easily generalize to the case of a variable learning rate 
ri(t). Sums such as #£^o would simply be replaced by ^ £l=o a tN - e r)(£/N). Using 

ff ti\r-/ = (1 _ 1 / N yN~e _ exp [_ 7t + 7 ^/tv + 0{l/N)] we see that 



1 tN r t 

J im ^E^~VW) = / dse~^ V (s) 
N^oo jy ^ Jo 



which reduces to the familiar result in the case when rj is constant. Other sums involving a 
variable learning rate can be treated in similar fashion. 

The generalization error follows directly from the above results and (0); its asymptotic value 

is 

lim E„(t) = — arccos f — = , ., | (17) 
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Figure 1: Relation between the real output error probability A rea i and the effective output error 
probability A e g for noisy teachers. For output noise the two are identical. One observes that, 
for the same 'real' noise level, weight noise is significantly less disruptive to the learning process 
than output noise. 



More generally, one sees from (Hgg) that (for N -> oo) all noisy teachers with the same p will 
give the same generalization error at any time t. This is true, in particular, of output noise 
and Gaussian weight noise when their respective parameters A and S are related by 1 — 2A = 
(l + £ 2 )~2 . More generally, one can use ( |l3| ) to associate, with any type of teacher noise obeying 
our basic assumption (||), an effective output noise parameter A e ff given by 

l-2A eff = y|p=y| (yr(y)) (18) 

Note, however, that this effective teacher error probability A e g- will in general not be identical 
to the real teacher error probability A rea i- The latter is defined as the probability of an incorrect 
teacher output for a random input, A rea i = (P(T = — sgn(£?* • Using this can be 

rewritten as A rea i = ^g[l — sgn(B* • £)T(£)]^, and with (|2|) one obtains 

1 - 2A rcal = ( sgn(y)r(y)) . (19) 



Comparing with (18), one sees that in the effective error probability that is relevant to our 
Hebbian learning process, errors for inputs with large teacher fields y are weighted more heavily 
than in the real error probability. For output noise, this is irrelevant because the probability 
of an incorrect teacher error is independent of y, and A rea i and A e ff are therefore identical. For 
Gaussian weight noise, on the other hand, errors are most likely to occur near the decision 
boundary of the teacher (y = 0). These are suppressed by the weighting in the effective error 
probability, and so A c fj < A rea i. Explicitly, one finds in this case A rea i = \ arctan S, and from Jl4]) , 
A e ff = |[1 — (1 + X 2 ) -1 / 2 ]; the relation between effective and real error probabilities for Gaussian 
weight noise (see figure [l]) is therefore 

Aeff = \\ l ~ cos(7rA rca i)] = sin 2 (7rA rca i/2). 
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We now briefly consider whether the generalization error E g (t) can have a minimum at a 
finite time t, i.e., whether overtraining can occur in our problem. After a straightforward but 
tedious calculation we find that E g (t) as given by (|7|, [l5| , p^ ) is stationary at the point t = t*, 
where 



-log 

7 



2^/rj 1 pQo sm 2 (nE g fl) - 2Q l J 2 a 1 cos(7r£ , gi0 ) + -qp 



(20) 

VP-Qo (2/ Q + 7)cos(7i\E g , ) 
Here -E gi o = E g (0) is the initial generalization performance of the student. It turns out that 
E g (t) has a minimum at t* if the numerator of the logarithm in equation (^) is negative. Of 
course, t* must be real and positive - which demands that the denominator of the logarithmic 
term in (|20|) be negative, and that the numerator be less than the denominator. This implies 
that E g (t) will have a minimum at t* if 



a < 



2Q^ 2 cos(7rE , gi o) 



and T) cos(7r£ , gi o) < 2Qq 2 p sin 2 (7r.E. 



g,oJ 



(21) 



VP-lQo cos(7rS g)0 ) 

In writing (|2l"| ) we have made the reasonable assumption that -E g; o £ [0, i], corresponding to 
an initial performance no worse than random guessing. When the conditions (^) are satisfied 
the generalization error has a minimum at t* and overtraining occurs for t > t* . However, in 
practice this phenomenon does not appear to be of great significance, with typically t* < 1. 



4 Joint Field Distribution 



The calculation of the average of the joint field distribution starting from equation ( |10| ) is more 
difficult than that of the scalar observables. It is convenient to work in terms of the characteristic 
function 



using equations (|§|,|i"0,ll), we then find that 

P t (x,y,f) = ton (I^expH(£a w Jo-^ + yS*-^ + fT^)] 



(22) 



x ( exp 



ir\x 



tN 



(23) 



sets 



Performing the path average gives 

/ / . * tN 



tN 



exp 



i=0 / / 1=0 

After substitution of this result into (f23|), only a training set average remains. Once this has 
been carried out, all terms in the sum over p will be exactly equal. Anticipating this by setting 
p = 1, we get 



P t (x,y,f) = Jim ( exp[-i(xo tN J -S 1 + yB*-t 1 +fT 1 )] 



N^oo 
tN 

*n 

1=0 



(24) 



J ' sets 



s 



Consider now the product S = 11^=0 [• • •]• The v = 1 term °f the sum in square brackets needs 
to be treated separately because £ x • £ x = N. For v > 1, on the other hand, the products £ x • £ w 
are overlaps between different input vectors and therefore only of 0(y/N); the rescaled overlaps 
v u = £, l -£, u /^fN are of C(l). In the sum over v > 1 in 



tN 

log 5 = lo S 



exp I —irjxa 



tN-lrpl 



^T 1 



the exponential therefore has an argument of 0(N~ l l 2 ) and can be Taylor expanded. Terms 
up to 0(1/ N) (i.e., up to second order) need to be retained because of the sum over the O(N) 
values of £, and so 



log S 



tN 



5> 

tN 



1 

- cxp 

V 



-irjxa 



tN-irpl 



T + 



P 



lVX -.a m ~ l v v T v 



ry-i 2 ™ 2 

H x ItN -21 2 

2N 



-E ex p(- 



• a fN—f 

-irjxa 



where contributions of 0(N 1 ^ 2 ) have been discarded. Transforming the first sum over I into 
an integral over time (by considering appropriate Riemann sums), we then obtain 



where 



and 



log 5 = x{xT l 
X(w) = 



1 

a Jo 
1 
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ds { 



exp 



Ui 



E^r 



-vqwe 



U 2 



(25) 
(26) 



E 



1/>1 ^ Z/>1 

Further progress requires considering the statistics of the random variables u\ and Ui- For 
iV — > oo, the v v are independent Gaussian variables with zero mean and unit variance. By 
the central limit theorem, U2 therefore has fluctuations of 0(N~ 1 ^ 2 ) and can be replaced by 
its average (U2) = 1 in the thermodynamic limit. Similarly, because the products v v T v are 
uncorrelated for different v, u\ becomes Gaussian in this limit. Using (12), its mean and variance 
can be calculated as 



<«l) 

(Am) 2 



1 



E w) 



1 



U>1 



p 



aN 



-[ 



(VuY 



aN 



a 



t 1 • £T(£)\ = pB* ■ £ l + 0(N- 1 ) 



1 - 0(N- 1 ] 



We conclude that, for large N, u\ = pB* + a _1 / 2 u, where u is a unit variance Gaussian 
random variable with mean zero. We are now in a position to average S as given by (^5[) over 
all realizations of ,T w ),v > 1}, with the result 



(S) = exp 



X (^)- g^ (l-e^) 



7 



2a 7 2 



-7^2 



r/ 2 x 2 
47 



-27* 
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Inserting this into equation ( |24| ) for the characteristic function, we are left with a final average 
over and T 1 , with the former entering only through the fields u = Jo ■ ^ 1 and y 1 = B* ■ £ : 



Pt{x,y,f) 



exp 



i(xe' jt u + yy 1 + TT 1 ) + x{xT v ) 

2-yU 



#i(l-e^) 2 -^(l-e 
2a 7 2V 7 4 7 v 



(27) 



We now observe that T 1 only depends on y 1 , but not on u; correspondingly, u is independent of 
T 1 if y 1 is given. For large N, the two fields u and y 1 are zero mean Gaussian random variables 
with (u 2 ) = Q , (uy 1 ) = R and ((y 1 ) 2 ) = 1. The average of the u-dependent factor in (p7|), for 
given y , is therefore 



(exp (- 



exp 



-ixe 



-x 2 e- 2 ^(Q -R 2 ) 



Inserting this into (|27|), and using (15,16), one finds that the terms in the exponential which are 
linear in x combine to a term proportional to R(t), whereas the quadratic terms in x conspire 
to give a contribution proportional to Q(t) — R 2 (t): 



Pt(x,y,f) = (exp 



-iyy 1 - iTT 1 + x(xT r ) - ^x 2 (Q - R 2 ) - iRxy 1 



/IT 1 



(28) 



Finally, we recast this result in terms of the conditional distribution of x, given y and T. To 
do this, first note that the distribution of y l and T l that is to be averaged over on the right 
hand side of (|^) is just the distribution of the teacher field y and the teacher output T over the 
training set. We rename them appropriately and write out the definition (22) of the characteristic 
function on the left hand side: 



dxdy exp —iyy — iTT — ixx Pt(x\y,T)P(y,T) 

T=±l 

1 



exp 



T=±l 

Equality for all y and T implies that 

J dx exp(— ixx) P t (x\y,T) = exp 

and hence our final result^ 
Pt(x\y,T) 



iyy — iTT + x(xT) — -x 2 (Q — R 2 ) — iRxy 



P(y,T). 



X (xT)--x 2 (Q-R 2 )-iRxy 



dx 



2tt 



exp 



ix(x - Ry) + X (xT) - -x 2 {Q - R 2 ) 



(29) 



which is remarkably simple. In particular, we note that in this conditional distribution of x, 
the noise properties enter only through the parameter p; in fact, they only affect the factor 



2 Equation (|2Sj) can also be derived by using Fourier transforms to obtain P t (x,y,T) from (p8|), and then 
dividing by P(y,T). 
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exp(— ixRy), while both Q — R 2 and x(£T) are actually independent of p. Eq. (29) also shows 
that the dependence of the student field on y and T can be written in the simple form 



x = Ry + Ai + TA 2 

where Ai and A2 are random variables which are independent of each other and of y and T. 
Remarkably, they also do not depend on any properties of the noisy perceptron teacher: Ai 
is simply Gaussian with zero mean and variance Q — R 2 , while the distribution of A2 follows 
from the characteristic function /exp(— iAA2)^> = exp(x(A)). All non-Gaussian features of the 
student field distribution are encoded in A 2 . Because x(-) is inversely proportional to a, the 
size of the training set, it is immediately obvious how the student field distribution recovers its 
Gaussian form for a — > 00. 

Using the fact that y is Gaussian with zero mean and unit variance, the training error Ptr 
and student field probability density Pt(x) follow from ( p9| ) as 

E tr = fdxDy ]T 8(-xT)P t (x\y,T)P(T\y) (30) 

J T=±l 

P t (x) = I Dy Pt(x\y,T)P(T\y) (31) 

J T=±l 

1 1,2 

in which Dy = (2ir)~2e~2 y dy. We note again that the dependence of E tI and Pt(x) on the 
specific noise model - for a given value of p — arises solely through P(T\y). We remind the 
reader that this teacher output probability is given by (||), 

P(T\y) = (l-X) 6(Ty) + X6(-Ty) 

for the case of output noise, while for weight noise (||) implies 

P(r|y) = i[l + Terf(y/V2£)]. 



5 Comparison with Numerical Simulations 

From the theoretical point of view, equations (^9,30,^) constitute the clearest expression of 
our results on the joint field distribution since the dependence of the distribution on the given 
noise has been separated out in a transparent manner. However, we have found that another 
equivalent formulation can be useful from the point of view of numerical computations. This is 
detailed in the appendix. 

It will be clear that there is a large number of parameters that one could vary in order to 
generate different simulation experiments with which to test our theory. Here we have to restrict 
ourselves to presenting a number of representative results. Figure ^ shows, for the output noise 
model, how the probability density Pt(x) of the student fields x = J develops in time, starting 
as a Gaussian distribution at t = (following random initialization of the student weight vector) 
and evolving into a highly non-Gaussian bi-modal one. Figure || compares our predictions for the 
generalization and training errors E g and E tT with the results of numerical simulations (again 
for teachers corrupted by output noise) for different initial conditions, P gi o = and P g .o = 0.5, 
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Figure 2: Student field distribution Pt(x) observed during on-line Hebbian learning with output 
noise of strength A = 0.2, at different times (from left to right: £ = 1,2, 3, 4), for training set size 
a = 2, learning rate 77 = 1, and weight decay 7=5, with initial conditions Qq = 1 and Rq = 0. 
Histograms: distributions as measured in numerical simulations of an N = 10,000 system. Solid 
lines: predictions of the theory. 



and for different choices of the two most important parameters A (which controls the amount of 
teacher noise) and a (which measures the relative size of the training set). The system is found 
to have no persistent memory of its past (which will be different for some other learning rules), 
the asymptotic values of E g and E tT being independent of the initial student vector^ 

Figure [I| shows the probability density Pt(x) of the student fields x = J ■ £ for the Gaussian 
weight noise model, with effective error probability A e ff chosen identical to the error probability 
used to produce the corresponding graphs § for output noise. Finally we show in figure || 
an example of a comparison between the error measures corresponding to teachers corrupted 
by output noise and teachers corrupted by Gaussian weight noise, both with identical effective 
output noise probability A c g = 0.2. Here our theory predicts both noise types to exhibit identical 
generalization errors and almost identical training errors (with a difference of the order of 10 , 
see the appendix) at any time. These predictions are borne out by the corresponding numerical 
simulations (carried out with networks of size N = 10,000). We conclude from these figures 
that in all cases investigated the theoretical results give an extremely satisfactory account of the 
numerical simulations, with finite size effects being unimportant for the system sizes considered. 

Careful inspection shows that for Hebbian learning there are no true overfitting effects, not 

3 In the examples shown, E g is always larger than E tT - However, this is not true generally: We are measuring 
the generalization error E s with respect to the clean teacher, whereas the (training) examples that determine 
the training error E t are noisy. Thus, under certain circumstances, E t can be larger than Eg . A trivial example 
is the case of an infinite training set (a — > 00) without weight decay (7 = 0). From (LL 7|), Eg then tends to 
zero for long times t, while the training error will approach E t = A rca i, which is nonzero for a noisy teacher. A 
generalization error relative to the noisy teacher can also be defined in our problem; it turns out to be E g (noisy) = 
{l-(r{y)e,i{yR[2{Q-R 2 )]-^))}/2. 
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Figure 3: Generalization errors (diamonds/lines) and training errors (circles/lines) as observed 
during on-line Hebbian learning from a teacher corrupted by output noise, as functions of time. 
Upper two graphs: noise level A = 0.2 and training set size a £ {0.5, 4.0} (initial conditions: 
upper left, E gt o = 0.5; upper right: E gt Q = 0). Lower two graphs: a = 1 and A G {0.0,0.25} 
(lower left, E gt o = 0.5; lower right, E gj Q = 0). Markers: simulation results for an N = 5,000 
system. Solid lines: predictions of the theory. In all cases Qo = 1, learning rate r\ = 1 and 
weight decay 7 = 0.5. 

even in the case of large A and small 7 (for large amounts of teacher noise, without regularization 
via weight decay) . Minor finite time minima of the generalization error are only found for very 
short times (t < 1), in combination with special choices for parameters and initial conditions. 
For time-dependent learning rates, however, preliminary work indicates that overfitting can 
occur quite generically. 

6 Discussion 

Starting from a microscopic description of Hebbian on-line learning in perceptrons with restricted 
training sets, of size p = aN where N is the number of inputs, we have developed an exact 
theory in terms of macroscopic observables which has enabled us to predict the generalization 
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Figure 4: Student field distribution Pt{x) observed during on-line Hebbian learning with Gaus- 
sian weight noise of effective error probability A c g- = 0.2 (compare eq. (|l8|)), at different times 
(from left to right: t = 1,2,3,4), for training set size a = |, learning rate rj = 1, and weight 
decay 7=5, with initial conditions Qq = 1 and i?o = 0. Histograms: distributions as measured 
in numerical simulations of an N = 10,000 system. Solid lines: predictions of the theory. See 
appendix for further discussion of the close similarities with figure ||. 

error and the training error, as well as the probability density of the student local fields, in the 
limit N — > 00. Our results are found to be in excellent agreement with numerical simulations, 
as carried out for systems of size N = 5,000 and N = 10,000, and for various choices of the 
model parameters, both for teachers corrupted by output noise and for teachers corrupted by 
Gaussian input noise. Generalizations of our calculations to scenarios involving, for instance, 
time-dependent learning rates or time-dependent decay rates are straightforward. Closer analysis 
of the results for these cases, and for more complicated teachers such as noisy 'reversed wedges', 
may be an issue for future work. 

Although it will be clear that our present calculations cannot be extended to non-Hebbian 
rules, since they ultimately rely on our ability to write down the microscopic weight vector J at 
any time in explicit form (|ll|) , they do indeed provide a significant yardstick against which more 
sophisticated and more general theories can be tested. In particular, they have already played 
a valuable role in assessing the conditions under which a recent general theory of learning with 
restricted training sets, based on a dynamical version of the replica formalism, is exact || |7|. 

Acknowledgments: PS is grateful to the Royal Society for financial support through a 
Dorothy Hodgkin Research Fellowship. 
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A Evaluation of the Field Distribution and Training Error 

In this appendix, we give alternative forms of our main results ( p9| , pC|j3l| ) for the joint field 
distribution and training error that are more suitable for numerical work. For this purpose, it 
is useful to shift attention from the noisy teacher output T to the corrupted teacher field z that 
produces it; the two are linked by T = sgn(z). This is entirely natural in the case of Gaussian 
weight noise. As discussed after eq. (|j), z then differs from the clean teacher field y by an 
independent zero mean Gaussian variable with variance £ 2 ; explicitly, one has the conditional 
distribution 

P(z\y) = e~( z ~ y ^ / 2S (Gaussian weight noise). 

v27r£ 2 

The case of output noise can be treated similarly, by assuming that z is identical to y with 
probability 1 — A, but has the opposite sign with probability A: 

P(z\y) = (1- X)S(z-y)+XS(z + y) (output noise). (32) 

We now consider the joint distribution Pt(x,y,z). It can be derived by complete analogy with 
the calculation in section ^[ For the conditional distribution of x, one finds that 

P t (x\y,z) = P t (x\y, sgn(z)). 

Intuitively, this follows from the fact that during learning, the student only ever sees the noisy 
teacher output sgn(z), but not the corrupted field z itself; the student field x can therefore 
depend on z only through sgn(z). Multiplying by the joint distribution of y and z, and using 
the result (|29|) , one thus finds, for the case of output noise, 

--y 2 a~ 

P t (x, y, z) = [(1 - A) 8{z -y) + X5{z + y)] [— e -§* 2 (Q-* 2 )+^-^)+x(3 sgn(*)) 

V27T J 2ir 

with the marginal distribution 

P t ( x , z ) = *L2=. J^L e -\& 2 (Q-&)+txx+ x {& agn(*)) _ x)e~ ii;zR + Xe i&zR ] . (33) 



The corresponding expressions in the case of Gaussian weight noise read 

PAx V Z) = — f — e -l£ 2 (Q-R 2 )+™fr-Ry)+X(& sgn( Z ))-[^-2^+j/ 2 (l+S 2 )]/(2S 2 ) 

V ' y ' ' 2vrS J 2vr 
and i ^ 

P t {x t z ) = e ~ Z /(1+S J f e -^ 2 lQ-R 2 /{i+z 2 )}+i*lx-Rz/{i+z 2 )]+x(z sgn( z )) _ ( 34 ) 
' \/27r(l + S 2 ) J 2tt 

In both cases, the training error and the probability distribution of the student field x are then 
determined by 

E tT = j dx dz 6(—xz) Pt(x, z) P t (x) = JdzPt(x,z) 

respectively. For a numerical computation of these two quantities, it is imperative to further 
reduce the number of integrations analytically, which turns out to be possible. In the following, 
we drop the time subscript t on all distributions to save notation. 
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First we deal with the case of output noise. In the marginal distribution (p3[), we make the 
change of variable x = k sgn(z) to get 

p(x t z ) = J _ e -\k 2 {Q-R 2 )+ x {k)+ikx s m (z) _ ^y-ik\z\R + ^ e %k\z\R^ _ 

The training error is 

/• /"OO 

Str = J dxdzP(x,z)9(-xz) = J dx [P+(-x) + P-(x)] 

where 

P±(x) = /d2 z) 0(±z) = ~ /" ^ D ze -^ 2 (Q-« 2 )+x(fc)±^- {(i _ A) e - ife l 2 l /? + \e ik ^ R } 

2 2 " (35) 
We see that P+{x) = P-{— x) = 11(2;). In terms of II(x) we have the formulae 

/•CO 

P(x) = II(x) + II(-x) E tr = 2 dxU(-x) (36) 

J o 

The function II(x) can be further simplified by decomposing X in to its real (x r = R- e (x)) an d 
imaginary ( X \ = Im(x)) parts: 

dk 



U( x ) =J^Dz e -\k\Q^) +x (k) +lkx | (1 _ x)e -ik\z\R + Xe ik\z\R} 



^ £)^ e -|fc 2 «-R 2 )+Xr« {(i _ A ) cos[xi(A:) + k(x - R\z\)] + Acos[xi(Ai) + k{x + R\z\)]} 
Air 



dk 
4 

in which 



J ^ e~^ Qk2+Xl{k) {cos[xi(A:) + kx] + (1 - 2A) sm[ Xi (k) + kx] G(kR)} (37) 



G(A) = e^ A2 1 sin(A|z|) = A lFl Q ; | I A 2 ) 



(38) 



and iFi(. . .) is the degenerate hypergeometric function (see f|], page 1058). From equation (|3 
we now immediately obtain our final result for the student field distribution: 

P(x) = [ ^ e ~\Qk 2 +Xr{k) cos ( fcx ) { C os[xi(A:)] + (1 - 2\)G{kR) sin[ X i(fc)]} (39) 
J 2ir 

To further simplify the expression (^) for the training error, we write 

,o 

E tI = lim 2 / dxU(x) = 2 lim I(L) 

L— >oo J—L L^oo 

where, from (B7|) 

/(L) = y ^ e -5Q fc2 +x r W | y° dx cos[xi(A;) + fcx] + (1 - 2\)G(kR) J° dx sm[ Xi (k) + fcx] J 
Thus 

J(oo) = - / A e -|Q fc2 +x r (fc) {(i _ 2X)G(kR) coslxiik)} - sin[ Xi (fc)]} + 

lim / A e ~iQfc 2 +x r W { sin [ fe L - X i(k)] + (1 - 2X)G{kR) cos[kL - Xi (k)}} (40) 
L^ooJ Airk 

17 



'tr 



0.0959 



0.0957 



0.0955 



output noise 



input noise 



0.0953 



10 



20 



30 



40 



50 



t 



Figure 6: Characteristic example of theoretical predictions for the training error E tT for two 
noisy teachers with identical effective error probability A e ff = 0.2. Dashed line: output noise; 
solid line: Gaussian weight noise. Parameters: a = 7 = 0.5, Qo = r/ = 1, E g ^ = 0.5. 



The L-dependent integral in ( f40[) can be expressed as a sum of two integrals, which we consider 
separately. In the first part, we replace k by k/L and obtain 



lim / 

L^ooJ 4nk 



lim 



-iQfc 2 +Xr(fc) s [ n [kL - X i(k)} 



dk 



-\Q{k/Lf+ X r{k/L) 



e 2 



L^ooJ Airk 

Secondly we need to consider the behaviour of 



sin[A; - Xi(k/L)} 



dk 
4irk 



sin(A;) 



dk 
Airk 



e 2 



iQfc 2 + Xr (fc) 



cos[kL - X i(k)]G(kR) 



(41) 



in the limit L — ► 00. We set u = kR and note that, because Q > R 2 , one has e ^ k < e 2 n 
furthermore, 



Dz\z 



s'm(\uz\) 



\uz\ 



< / Dz\z\ 



Finally, is independent of L and is bounded as a function of k; in fact, from (p6[ ), |x(^)l — 
2a~ 1 t. It follows by an application of the Riemann-Lebesgue Lemma (see e.g. |10| ) that the 



integral (41) tends to zero as L — > 00. We conclude that for output noise the training error is 
given by 



tr 



dk 
2^k 



e -hQk 2 +Xr(k) { (1 _ 2X)G(kR) cos[ X i(k)} - sm[ Xi (k)]} 



(42) 



where G(. . .) is defined by (|38|). 
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The procedure for Gaussian weight noise is similar to that of output noise. We start from 
equation ( ^4|) and define 

R = R/Vl + Z 2 . 



Upon defining x = k sgn(z) in (^J), replacing z by z/y/l + X 2 , and continuing in the same 
notation as for output noise, we find 

P± ( x \ = I f Dze -^k 2 (Q-R 2 )+x(k)±ikx-ikR\z\ ( 43 ) 
2 J 2tt 

Since (|43|) can be obtained from (|35| ) by putting A — > and i? — > i£, we immediately obtain for 



the student field distribution and the training error, respectively [see equations ( |39| ) and (42)], 
P(x) = J ^ e -5Q fc2 +XrW cos (A;x) {cos[xi(A:)] + G(fc#E) sin[xi(fc)]} (44) 

Etr = \~ l^k e " |Qfc2+Xr(fc) cos[xi(fc)] - sin[xi(fc)]} (45) 

In particular, we can now calculate the student field distribution and the training error for both 
output noise and Gaussian weight noise, with noise levels such that in both cases A e g- = A. This 
guarantees that, at any time, Q, R and E g will have the same values in both cases; it also implies 
R = R/y/l + X 2 = R(l — 2A). We then obtain from (^,4^,44,45) very similar expressions: 



P out (x) = f=£ e -iQ fc2 +Xr(fc) cos (A;x) {cos[xi(A;)] + (1 - 2\)G{kR) sin[xi(fc)]} 

P^{ x ) = [— e ~\Q* 2 +Xr{k) cos ( fex ) {cos[xi(A;)] + G[(l - 2\)kR] sin[xi(fc)]} 
J 2ir 



and 



K ^ = 2-/2^ e ~" Q " 2+Xr(fc){(1 " 2A)G( ^ )c0s[Xi(fc)] " Sin[Xi(fc)]} 

Er = \-J^^ Qk2+Xi{k HG[(i-2X)kR] C o S [ X m-Mxm} 

Provided parameters are chosen such that the effective error probabilities are identical, the 
differences between output noise and Gaussian weight noise are restricted to the positioning of 
the factor 1 — 2 A relative to the integral G(. . .), with manifestly identical expressions for A = 
and A = i (as it should be). As a result the resulting curves for field distributions and training 
errors are found to be almost identical; figure || shows a typical example. 



19 



